Evaluating The Performance of Non-Blocking Synchronisation on Shared-Memory Multiprocessors

نویسندگان

Philippas Tsigas

Yi Zhang

چکیده

Parallel programs running on shared memory multiproces sors coordinate via shared data objects structures To en sure the consistency of the shared data structures programs typically rely on some forms of software synchronisations Unfortunately typical software synchronisation mechanisms usually result in poor performance because they produce large amounts of memory and interconnection network con tention and more signi cantly because they produce con voy e ects that degrade signi cantly in multiprogramming environments if one process holding a lock is preempted other processes on di erent processors waiting for the lock will not be able to proceed Researchers have introduced non blocking synchronisation to address the above prob lems Non blocking implementations allow multiple tasks to access a shared object at the same time but without en forcing mutual exclusion to accomplish this However its performance implications are not well understood on mod ern systems or on real applications In this paper we study the impact of the non blocking synchronisation on paral lel applications running on top of a modern processor cache coherent shared memory multiprocessor system the SGI Origin Cache coherent non uniform memory ac cess ccNUMA shared memorymultiprocessor systems have attracted considerable research and commercial interest in the last years In addition to the performance results on a modern system we also investigate the key synchronisation schemes that are used in multiprocessor applications and their e cient transformation to non blocking ones Eval uating the impact of the synchronisation performance on applications is important for several reasons First micro benchmarks can not capture every aspect of primitive per This work is partially supported by i the national Swedish Real Time Systems research initiative ARTES www artes uu se supported by the Swedish Foundation for Strategic Research and ii the Swedish Research Council for Engineering Sciences formance It is hard to predict the primitive impact on the application performance For example a lock or barrier that generates a lot of additional network tra c might have little impact on applications Second even in applications that spend signi cant time in synchronisation operations the synchronisation time might be dominated by wait time due to load imbalance and lock serialisation in the applica tion which better implementations of synchronisation may not be helpful in reducing Third micro benchmarks rarely capture generate scenarios that occur in real applications We evaluated the bene ts of non blocking synchronisation in a range of applications running on top of modern realiza tions of shared memory multiprocessors a processor SGI Origin In this evaluation i we used a big set of appli cations with di erent communication characteristics mak ing sure that we include also applications that do not spend a lot of time in synchronisation ii we also modi ed all the lock based synchronisation points of these applications when possible The goal of our work was to provide an in depth understanding of how non blocking can improve the per formance of modern parallel applications More speci cally the main issues addressed in this paper include i The archi tectural implications of the ccNUMA on the design of non blocking synchronisation ii The identi cation of the basic locking operations that parallel programmers use in their applications iii The e cient non blocking implementation of these synchronisation operations iv The experimental comparison of the lock based and lock free versions of the re spective applications on a cache coherent non uniform mem ory access shared memory multiprocessor system v The identi cation of the structural di erences between applica tions that bene t more from non blocking synchronisation than others We selected to examine these issues on a processor SGI Origin multiprocessor system This ma chine is attractive for the study because it provides an ag gressive communication architecture and support for both in cache and at memory synchronisation primitives It should be clear however that the conclusions and the methods pre sented in this paper have general applicability in other real izations of cache coherent non uniform memory access ma chines Our results can bene t the parallel programmers in two ways First to understand the bene ts of non blocking synchronisation and then to transform some typical lock based synchronisation operations that are probably used in their programs to non blocking ones by using the general translations that we provide in this paper Experiments and Main Results The SGI Origin that we used has MHz MIPS R CPUs with MB L cache and GB main mem ory We used a large group of applications some of which are from the SPLASH suite and some of which were devel oped more recently and constitute the shared memory part of the Spark kernels suit More speci cally from slash we used the following applications i Ocean ii Volrend iii Radiosity iv Water Nsquared v Water Spatial Because we wanted to make the evaluation on realistic problem sizes for these multiprocessors we selected large problem sizes that do not favour synchronisation but still as we will show later the improvements were signi cant for most applica tions Generally the larger the problem size the lower the frequency of synchronisation relative to computation After studying the applications that we had selected we iden ti ed the lock based high level synchronisation operations that they use As a next step we proposed a set of e cient lock free implementations for these synchronisations The description of the implementations are general enough and can be used in other parallel applications These im plementations together with the detailed modi cations for each application can be found in the full paper The results from our experiments show that For Ocean there was no signi cant improvement after the modi cation but the non blocking synchronisation do not hamper the performance of Ocean Ocean is a regular application with very regular communication patterns and below proces sors the synchronisation time does not contribute much to the total execution time Because the ocean application re quires the number of processes to be power of we could only do the experiments for up to processors For Ra diosity there was no big di erence between the two versions lock based one and non blocking one until we reached processors where synchronisation became a signi cant part of the total computing time With processors the non blocking version is about faster than the lock based one and as the number of processors increases the improvement on the performance also increase reaching a better per formance when using processors the maximum number of processors that we could use exclusively for running this application The access patterns to shared data structures in Radiosity are highly irregular For Volrend the perfor mance advantages of the non blocking synchronisation start to show as the number of processors becomes greater than The performance of the non blocking one is close to opti mal since its speed up is very close to the theoretical limit Volrend s inherent data referencing pattern on data that are written is migratory while its induced pattern at page gran ularity involves multiple producers with multiple consumers For the Spark applications due to the limited time for exclusive use that we had we performed the experiments for up to processors for this application The results clearly show the power of non blocking synchronisation for unstruc tured applications like this one The speedup of the lock based programs stops when we go above processors while the non blocking one continues to scale uniformly This al lows us to conjecture that non blocking will dramatically in crease the performance of these applications as the number of processors increases In Water nsquared and Water spatial the communication patterns and the sharing of the data is very simple A process updates a local copy of the particle accelerations as it computes them and accumulates into the shared copy once at the end This simple commu nication pattern does not give the opportunity to lock free synchronisation to show its power On the other hand the experiments show that lock free synchronisation does not harm the performance of the applications The lock free versions of both applications perform as well as the respec tive lock based ones To conclude i For the fairly wide range of applications ex amined non blocking synchronisation performs as well and often better than the respective blocking synchronisation ii For certain applications the use of non blocking syn chronisation yields great performance improvement Fig ure describes graphically the maximum speedup of the lock free and the respective lock based implementation for each of our implementations With processors the non blocking version of radiosity is about two times faster than the lock based one non blocking Volrend is about times faster that the lock based one Irregular applications ben e t the most from non blocking synchronisation Since the importance of such applications is likely to increase in the future the importance of lock free synchronisation in high performance parallel systems is also expected to increase iii The methods that we introduced to replace lock based synchronisations are quite simple and general to be used in many parallel applications

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating The Performance of Non-Blocking Synchronisation on Modern Shared-Memory Multiprocessors

Parallel programs running on shared memory multiprocessors coordinate via shared data objects/structures. To ensure the consistency of the shared data structures, programs typically rely on some forms of software synchronisations. Unfortunately typical software synchronisation mechanisms usually result in poor performance because they produce large amounts of memory and interconnection network ...

متن کامل

Modeling and Performance Evaluation of Multi-Processors Organization with Shared Memories

This paper is primarily concerned with theoretical evaluation of the performance of multiprocessors system. A markovian waiting line model has been developed for various different multi-processors configurations, with shared memory. The system is analysed at the request level rather than job level.

متن کامل

Practical Considerations for Non - Blocking Concurrent

An important class of concurrent objects are those that are non-blocking, that is, whose operations are not contained within mutually exclusive critical sections. A non-blocking object can be accessed by many threads at a time, yet update protocols based on atomic Compare-And-Swap operations can be used to guarantee the object's consistency. In this paper we take a practical look at the Compare...

متن کامل

Relative Performance of Preemption-Safe Locking and Non-Blocking Synchronization on Multiprogrammed Shared Memory Multiprocessors

Most multiprocessors are multiprogrammed to achieve acceptable response time. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for concurrent, atomic update of shared data structures: (1) preemption-safe locking and (2) non-blocking (lock-free) algor...

متن کامل

Localtiy and False Sharing in Coherent-Cache Parallel Graph Reduction

Parallel graph reduction is a model for parallel program execution in which shared-memory is used under a strict access regime with single assignment and blocking reads. We outline the design of an ee-cient and accurate multiprocessor simulation scheme and the results of a simulation study of the performance of a suite of benchmark programs operating under a cache coherency protocol that is rep...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Evaluating The Performance of Non-Blocking Synchronisation on Shared-Memory Multiprocessors

نویسندگان

چکیده

منابع مشابه

Evaluating The Performance of Non-Blocking Synchronisation on Modern Shared-Memory Multiprocessors

Modeling and Performance Evaluation of Multi-Processors Organization with Shared Memories

Practical Considerations for Non - Blocking Concurrent

Relative Performance of Preemption-Safe Locking and Non-Blocking Synchronization on Multiprogrammed Shared Memory Multiprocessors

Localtiy and False Sharing in Coherent-Cache Parallel Graph Reduction

عنوان ژورنال:

اشتراک گذاری